PILIPINAS IN A NUTSHELL (PINUT 2023)

"HIV AIDS is a disease with stigma. And we have learned with experience, not just with HIV AIDS but with other diseases countries for many reasons are sometimes hesitant to admit they have a problem."

– Margaret Chan

I. INTRODUCTION

It's time to look at the facts. 12026 entries. All women. What you are about to observe is the state of HIV in the Philippines last 2022.

This notebook will be divided into three parts. Data Cleaning. Data Exploration. Data Visualization. We will provide a walkthrough on how we were able to filter, analyze, and report on the dataset that we have obtained for this project.

II. DATA CLEANING

IMPORT PACKAGES

The main tool to be utilized in this endeavor will be Pandas, a library making use of the Python Programming Language (version 3.11) for Data Analysis. Before anything else, let us import the packages we need.

In [ ]:
# Pandas Library (Data Analysis)
import pandas as pd

# Numpy and Scipy Libraries (Numerical and Scientific Computing)
import numpy as np
import scipy as sp

# Sklearn Library (Machine Learning)
import sklearn; from sklearn import *

# Plotly Library (Data Visualization)
import plotly.io as pio
import plotly.subplots as ps
import plotly.graph_objects as go
import plotly.express as px

# For Displaying Dataframes in Jupyter Notebooks
from IPython.display import display

Make sure to install the packages first if you haven't already.
Here are the commands for each library as of 2024 (type these in your Terminal):

  1. Pandas: pip install pandas
  2. Numpy: pip install numpy
  3. Scipy: pip install scipy
  4. Sklearn: pip install scikit-learn
  5. Plotly: pip install plotly

READ THE DATASET FILES

It is now time to get our dataset. The .csv files themselves can be found in our GitHub Repository Page. We can use the read_csv() function in order for us to retrieve the files and store them as dataframes.

In [ ]:
# Store the File Paths into Variables
dataset_Codes_FilePath = "dataset_Codes.csv"
dataset_Proper_FilePath = "dataset_Proper.csv"

# Read the Files from said Variables
dataset_Codes_DataFrame = pd.read_csv(dataset_Codes_FilePath)
dataset_Proper_DataFrame = pd.read_csv(dataset_Proper_FilePath)

# Create a Copy of the Dataframes
dataset_Codes_Copy = dataset_Codes_DataFrame.copy()
dataset_Proper_Copy = dataset_Proper_DataFrame.copy()

# Optional Display Settings
pd.set_option("display.max_columns", 5)
pd.set_option("display.max_rows", None)
pd.set_option("future.no_silent_downcasting", True)
pio.renderers.default = "notebook"

WHY TWO FILES?

Here's why. Let's take a peek into our two dataframes so far.

In [ ]:
# These are the first five entries of the "Codes" Dataframe
dataset_Codes_Copy.head()
Out[ ]:
CASEID Case Identification
0 V000 Country code and phase
1 V001 Cluster number
2 V002 Household number
3 V003 Respondent's line number
4 V004 Ultimate area unit
In [ ]:
# These are the first five entries of the "Proper" Dataframe
dataset_Proper_Copy.head()
Out[ ]:
V000 V001 ... V867D V867E
0 PH8 1 ...
1 PH8 1 ...
2 PH8 1 ...
3 PH8 1 ...
4 PH8 1 ...

5 rows × 232 columns

Notice anything? Pay close attention to the entries of the CASEID Column from the Codes Dataframe.
Now, pay close attention to the column names of the Proper Dataframe.

Indeed. The reason why the Codes Dataframe is named as so is because it contains the codes of the actual survey questions asked in the Proper Dataframe. However, if we started working on the Proper Dataframe immediately, we would have a hard time seeing what question each column name corresponds to. We would have to tediously check both files, back and forth.

Let's make our lives easier. Let's rename all the columns of the Proper Dataframe. Sure, the column names will be long (for now), but at least we would have an idea of what question was asked without having to check on both files at the same time.

To do this, let us create a Dictionary with the list of Codes and their corresponding Survey Questions as the Key-Value pairs. Afterwards, let us use the rename() function in order to modify the column names of the Proper Dataframe.

In [ ]:
# Create a Dictionary of the Codes mapped to their respective Column Names
dataset_Codes_Dictionary = dict(zip(dataset_Codes_Copy["CASEID"], dataset_Codes_Copy["Case Identification"]))

# Rename the Columns of the "Proper" Dataframe
dataset_Proper_Copy.rename(columns = dataset_Codes_Dictionary, inplace = True)

# Let us view our changes
dataset_Proper_Copy.head()
Out[ ]:
Country code and phase Cluster number ... Things happened because HIV positive status: healthcare workers talked badly Things happened because HIV positive status: healthcare workers verbally abused
0 PH8 1 ...
1 PH8 1 ...
2 PH8 1 ...
3 PH8 1 ...
4 PH8 1 ...

5 rows × 232 columns

COMPLETE OR DELETE MISSING VALUES

There are several missing values in the dataset. Don't believe me? Look at this then.

WARNING: For demonstration purposes only, I will be printing ALL of the original columns from the Proper Dataframe. Don't worry, we will be trimming this significantly as we go on.

For the meantime, kindly brace for a lengthy scroll ahead.

In [ ]:
def dataFrame_Print_Missing_Info(dataframe: pd.DataFrame):
    # Create a new Dataframe to hold our Missing Data Info
    nullCount_dataFrame = pd.DataFrame(index = dataframe.columns)

    # Count the Number and Percent of Cells containing only Empty Spaces and Null Values for each Column
    nullCount_dataFrame.loc[:, "DataType"] = dataframe.dtypes
    nullCount_dataFrame.loc[:, "TotalCount"] = dataframe.shape[0]
    nullCount_dataFrame.loc[:, "NullCount"] = dataframe.apply(lambda x: x.eq(" ").sum() if x.dtype == "object" else x.isnull().sum())
    nullCount_dataFrame.loc[:, "NullPercent"] = (((nullCount_dataFrame["NullCount"] / nullCount_dataFrame["TotalCount"]) * 100)).round(2)

    # Display the Column Datatypes and the Info on Missing Data
    display(nullCount_dataFrame)

dataFrame_Print_Missing_Info(dataset_Proper_Copy)
DataType TotalCount NullCount NullPercent
Country code and phase object 27821 0 0.00
Cluster number int64 27821 0 0.00
Household number int64 27821 0 0.00
Respondent's line number int64 27821 0 0.00
Ultimate area unit int64 27821 0 0.00
Women's individual sample weight (6 decimals) int64 27821 0 0.00
Month of interview int64 27821 0 0.00
Year of interview int64 27821 0 0.00
Date of interview (CMC) int64 27821 0 0.00
Date of interview Century Day Code (CDC) int64 27821 0 0.00
Respondent's month of birth int64 27821 0 0.00
Respondent's year of birth int64 27821 0 0.00
Date of birth (CMC) int64 27821 0 0.00
Respondent's current age int64 27821 0 0.00
Age in 5-year groups int64 27821 0 0.00
Completeness of age information int64 27821 0 0.00
Result of individual interview int64 27821 0 0.00
Day of interview int64 27821 0 0.00
CMC start of calendar int64 27821 0 0.00
Row of month of interview int64 27821 0 0.00
Length of calendar int64 27821 0 0.00
Number of calendar columns int64 27821 0 0.00
Ever-married sample int64 27821 0 0.00
Primary sampling unit int64 27821 0 0.00
Sample strata for sampling errors int64 27821 0 0.00
Stratification used in sample design int64 27821 0 0.00
Region int64 27821 0 0.00
Type of place of residence int64 27821 0 0.00
NA - De facto place of residence object 27821 27821 100.00
Number of visits int64 27821 0 0.00
Interviewer identification int64 27821 0 0.00
NA - Keyer identification object 27821 27821 100.00
Field supervisor int64 27821 0 0.00
NA - Field editor object 27821 27821 100.00
NA - Office editor object 27821 27821 100.00
Line number of husband object 27821 13373 48.07
Cluster altitude in meters int64 27821 0 0.00
Household selected for hemoglobin int64 27821 0 0.00
Selected for Domestic Violence module int64 27821 0 0.00
Language of questionnaire int64 27821 0 0.00
Language of interview int64 27821 0 0.00
Native language of respondent int64 27821 0 0.00
Translator used int64 27821 0 0.00
Team number int64 27821 0 0.00
Team supervisor int64 27821 0 0.00
Region int64 27821 0 0.00
Type of place of residence int64 27821 0 0.00
NA - Childhood place of residence object 27821 27821 100.00
Years lived in place of residence int64 27821 0 0.00
Type of place of previous residence object 27821 16458 59.16
Region of previous residence object 27821 16458 59.16
Highest educational level int64 27821 0 0.00
Highest year of education object 27821 282 1.01
Ever heard of a Sexually Transmitted Infection (STI) int64 27821 0 0.00
Ever heard of AIDS int64 27821 0 0.00
NA - Reduce risk of getting HIV: do not have sex at all object 27821 27821 100.00
Reduce risk of getting HIV: always use condoms during sex object 27821 18636 66.99
Reduce risk of getting HIV: have 1 sex partner only, who has no other partners object 27821 18636 66.99
Can get HIV from mosquito bites object 27821 18636 66.99
Can get HIV by sharing food with person who has AIDS object 27821 18636 66.99
A healthy looking person can have HIV object 27821 18636 66.99
Condom used during last sex with most recent partner object 27821 11905 42.79
Condom used during last sex with 2nd to most recent partner object 27821 27740 99.71
Condom used during last sex with 3rd to most recent partner object 27821 27807 99.95
Source of condoms used for last sex object 27821 27341 98.27
Brand of condom used for last sex object 27821 27341 98.27
Had any STI in last 12 months int64 27821 0 0.00
Had genital sore/ulcer in last 12 months int64 27821 0 0.00
Had genital discharge in last 12 months int64 27821 0 0.00
NA - Had CS STI in last 12 months object 27821 27821 100.00
NA - Had CS STI in last 12 months object 27821 27821 100.00
NA - Had CS STI in last 12 months object 27821 27821 100.00
NA - Had CS STI in last 12 months object 27821 27821 100.00
Number of sex partners, excluding spouse, in last 12 months int64 27821 0 0.00
Number of sex partners, including spouse, in last 12 months int64 27821 0 0.00
Relationship with most recent sex partner object 27821 11905 42.79
Relationship with 2nd to most recent sex partner object 27821 27740 99.71
Relationship with 3rd to most recent sex partner object 27821 27807 99.95
NA - Length of time knows last partner object 27821 27821 100.00
NA - Length of time knows other partner (1) object 27821 27821 100.00
NA - Length of time knows other partner (2) object 27821 27821 100.00
NA - Sought advice/treatment for last STI infection object 27821 27821 100.00
NA - Sought STI advice/treatment from: government hospital object 27821 27821 100.00
NA - Sought STI advice/treatment from: CS public object 27821 27821 100.00
NA - Sought STI advice/treatment from: CS public object 27821 27821 100.00
NA - Sought STI advice/treatment from: CS public object 27821 27821 100.00
NA - Sought STI advice/treatment from: CS public object 27821 27821 100.00
NA - Sought STI advice/treatment from: CS public object 27821 27821 100.00
NA - Sought STI advice/treatment from: CS public object 27821 27821 100.00
NA - Sought STI advice/treatment from: CS public object 27821 27821 100.00
NA - Sought STI advice/treatment from: CS public object 27821 27821 100.00
NA - Sought STI advice/treatment from: CS public object 27821 27821 100.00
NA - Sought STI advice/treatment from: private hospital/clinic/doctor object 27821 27821 100.00
NA - Sought STI advice/treatment from: CS private object 27821 27821 100.00
NA - Sought STI advice/treatment from: CS private object 27821 27821 100.00
NA - Sought STI advice/treatment from: CS private object 27821 27821 100.00
NA - Sought STI advice/treatment from: CS private object 27821 27821 100.00
NA - Sought STI advice/treatment from: CS private object 27821 27821 100.00
NA - Sought STI advice/treatment from: CS private object 27821 27821 100.00
NA - Sought STI advice/treatment from: CS private object 27821 27821 100.00
NA - Sought STI advice/treatment from: CS private object 27821 27821 100.00
NA - Sought STI advice/treatment from: CS other object 27821 27821 100.00
NA - Sought STI advice/treatment from: CS other object 27821 27821 100.00
NA - Sought STI advice/treatment from: CS other object 27821 27821 100.00
NA - Sought STI advice/treatment from: CS other object 27821 27821 100.00
NA - Sought STI advice/treatment from: other object 27821 27821 100.00
NA - HIV transmitted during pregnancy object 27821 27821 100.00
NA - HIV transmitted during delivery object 27821 27821 100.00
NA - HIV transmitted by breastfeeding object 27821 27821 100.00
NA - Knows someone who has, or is suspected of having, HIV object 27821 27821 100.00
NA - Would want HIV infection in family to remain secret object 27821 27821 100.00
NA - Would be ashamed if someone in the family had HIV object 27821 27821 100.00
NA - Willing to care for relative with AIDS object 27821 27821 100.00
NA - A female teacher infected with HIV, but is not sick, should be allowed to continue teaching object 27821 27821 100.00
NA - Children should be taught about condoms to avoid AIDS object 27821 27821 100.00
Ever been tested for HIV int64 27821 0 0.00
NA - Know a place to get HIV test object 27821 27821 100.00
NA - Place for HIV test: government hospital object 27821 27821 100.00
NA - Place for HIV test: CS public object 27821 27821 100.00
NA - Place for HIV test: CS public object 27821 27821 100.00
NA - Place for HIV test: CS public object 27821 27821 100.00
NA - Place for HIV test: CS public object 27821 27821 100.00
NA - Place for HIV test: CS public object 27821 27821 100.00
NA - Place for HIV test: CS public object 27821 27821 100.00
NA - Place for HIV test: CS public object 27821 27821 100.00
NA - Place for HIV test: CS public object 27821 27821 100.00
NA - Place for HIV test: CS public object 27821 27821 100.00
NA - Place for HIV test: private hospital/clinic/doctor object 27821 27821 100.00
NA - Place for HIV test: CS private object 27821 27821 100.00
NA - Place for HIV test: CS private object 27821 27821 100.00
NA - Place for HIV test: CS private object 27821 27821 100.00
NA - Place for HIV test: CS private object 27821 27821 100.00
NA - Place for HIV test: CS private object 27821 27821 100.00
NA - Place for HIV test: CS private object 27821 27821 100.00
NA - Place for HIV test: CS private object 27821 27821 100.00
NA - Place for HIV test: CS private object 27821 27821 100.00
NA - Place for HIV test: CS other object 27821 27821 100.00
NA - Place for HIV test: CS other object 27821 27821 100.00
NA - Place for HIV test: CS other object 27821 27821 100.00
NA - Place for HIV test: other object 27821 27821 100.00
Heard about other STIs int64 27821 0 0.00
NA - Last 12 months had sex in return for gifts, cash or other object 27821 27821 100.00
Used a method at last sexual intercourse object 27821 12805 46.03
Method used last sexual intercourse: female sterilization object 27821 19398 69.72
Method used last sexual intercourse: male sterilization object 27821 19398 69.72
Method used last sexual intercourse: IUD object 27821 19398 69.72
Method used last sexual intercourse: injectables object 27821 19398 69.72
Method used last sexual intercourse: implants object 27821 19398 69.72
Method used last sexual intercourse: pill object 27821 19398 69.72
Method used last sexual intercourse: condom object 27821 19398 69.72
Method used last sexual intercourse: female condom object 27821 19398 69.72
Method used last sexual intercourse: emergency contraception object 27821 19398 69.72
Method used last sexual intercourse: standard days method object 27821 19398 69.72
Method used last sexual intercourse: lactational amenorrhea object 27821 19398 69.72
Method used last sexual intercourse: rhythm object 27821 19398 69.72
Method used last sexual intercourse: withdrawal object 27821 19398 69.72
Method used last sexual intercourse: patch object 27821 19398 69.72
Method used last sexual intercourse: mucus/billings/ovulation object 27821 19398 69.72
NA - Method used last sexual intercourse: CS object 27821 27821 100.00
Method used last sexual intercourse: other modern object 27821 19398 69.72
Method used last sexual intercourse: other traditional object 27821 19398 69.72
NA - Condom used at first sex object 27821 27821 100.00
NA - Most recent sex partner younger, the same age or older object 27821 27821 100.00
NA - 2nd to most recent sex partner younger, the same age or older object 27821 27821 100.00
NA - 3rd to most recent sex partner younger, the same age or older object 27821 27821 100.00
Wife justified asking husband to use condom if he has STI int64 27821 0 0.00
NA - Can get HIV by witchcraft or supernatural means object 27821 27821 100.00
Drugs to avoid HIV transmission to baby during pregnancy object 27821 2537 9.12
Would buy vegetables from vendor with HIV object 27821 2537 9.12
NA - Last time tested for HIV object 27821 27821 100.00
Months ago most recent HIV test object 27821 25577 91.93
NA - Last HIV test: on your own, offered or required object 27821 27821 100.00
Received result from last HIV test object 27821 25577 91.93
Place where last HIV test was taken object 27821 25577 91.93
NA - Age of first sex partner object 27821 27821 100.00
NA - First sex partner younger, same age or older object 27821 27821 100.00
NA - Time since last sex with 2nd to most recent partner object 27821 27821 100.00
NA - Time since last sex with 3rd to most recent partner object 27821 27821 100.00
NA - Used condom every time had sex with most recent partner in last 12 months object 27821 27821 100.00
NA - Used condom every time had sex with 2nd to most recent partner in last 12 months object 27821 27821 100.00
NA - Used condom every time had sex with 3rd to most recent partner in last 12 months object 27821 27821 100.00
NA - Age of most recent partner object 27821 27821 100.00
NA - Age of 2nd to most recent partner object 27821 27821 100.00
NA - Age of 3rd to most recent partner object 27821 27821 100.00
NA - Alcohol consumption at last sex with most recent partner object 27821 27821 100.00
NA - Alcohol consumption at last sex with 2nd to most recent partner object 27821 27821 100.00
NA - Alcohol consumption at last sex with 3rd to most recent partner object 27821 27821 100.00
Total lifetime number of sex partners object 27821 9425 33.88
Heard of ARVs to treat HIV object 27821 2537 9.12
NA - During antenatal visit talked about: HIV transmitted mother to child object 27821 27821 100.00
NA - During antenatal visit talked about: things to do to prevent getting HIV object 27821 27821 100.00
NA - During antenatal visit talked about: getting tested for HIV object 27821 27821 100.00
NA - Offered HIV test as part of antenatal visit object 27821 27821 100.00
NA - Offered HIV test between the time went for delivery and before baby was born object 27821 27821 100.00
NA - Tested for HIV as part of antenatal visit object 27821 27821 100.00
NA - Tested for HIV between the time went for delivery and before baby was born object 27821 27821 100.00
NA - Got results of HIV test as part of antenatal visit object 27821 27821 100.00
NA - Got results of HIV test when tested before baby was born object 27821 27821 100.00
NA - Place were HIV test was taken as part of antenatal visit object 27821 27821 100.00
NA - Tested for HIV since antenatal visit test object 27821 27821 100.00
Respondent can refuse sex object 27821 12299 44.21
Respondent can ask partner to use a condom object 27821 12299 44.21
NA - How long ago first had sex with most recent partner object 27821 27821 100.00
NA - How long ago first had sex with 2nd most recent partner object 27821 27821 100.00
NA - How long ago first had sex with 3rd most recent partner object 27821 27821 100.00
NA - Times in last 12 months had sex with most recent partner object 27821 27821 100.00
NA - Times in last 12 months had sex with 2nd most recent partner object 27821 27821 100.00
NA - Times in last 12 months had sex with 3rd most recent partner object 27821 27821 100.00
NA - Received counseling after tested for AIDS during antenatal care object 27821 27821 100.00
Knowledge and use of HIV test kits object 27821 2537 9.12
Children with HIV should be allowed to attend school with children without HIV object 27821 2537 9.12
NA - People hesitate to take HIV test because reaction of other people if positive object 27821 27821 100.00
NA - People talk badly about people with or believed to have HIV object 27821 27821 100.00
NA - People with or believed to have HIV lose respect from other people object 27821 27821 100.00
NA - Would be afraid to get HIV from contact with saliva from infected person object 27821 27821 100.00
NA - Knowledge and attitude to PrEP to prevent getting HIV object 27821 27821 100.00
Month of most recent HIV test object 27821 25577 91.93
Year of most recent HIV test object 27821 25577 91.93
Date of most recent HIV test (CMC) object 27821 25577 91.93
Result of HIV test object 27821 25738 92.51
Month received first HIV test positive object 27821 27820 100.00
Year received first HIV test positive object 27821 27820 100.00
Date received first HIV test positive (CMC) object 27821 27820 100.00
Currently taking ARVs object 27821 27820 100.00
Number of HIV tests object 27821 25577 91.93
Disclosed HIV status to others object 27821 27820 100.00
Respondent feels ashamed of HIV status object 27821 27820 100.00
Things happened because HIV positive status: people talk badly object 27821 27820 100.00
Things happened because HIV positive status: someone else disclosed status object 27821 27820 100.00
Things happened because HIV positive status: verbally insulted/harassed/threatened object 27821 27820 100.00
Things happened because HIV positive status: healthcare workers talked badly object 27821 27820 100.00
Things happened because HIV positive status: healthcare workers verbally abused object 27821 27820 100.00

Okay. Let's take this step by step. What insights can we gain from this?

  1. We have 232 survey questions. As you can clearly see, these are a lot. However, we won't actually be needing all of them. I will explain why later.

  2. There are only two data types in the entire dataset: 64-Bit Integers and String Objects. Let's first deal with cleaning the Objects first before touching the Integers.

  3. As suspected, there are several blank answers. All of these are actually strings with an empty space. These are what we need to filter out next.

  4. Some columns don't even have a single entry at all. This tells us something important: not every column should be treated equal. There are apparently columns that provide no value to us in this Project.

In order for us to remove the missing entries, let's first replace all the empty spaces with NumPy NaN (Not a Number) values instead. This way, we can then simply use the dropna() function in order for us to delete these Null data points.

Here are our conditions. If more than half the data of a column is missing, we remove the entire column from our dataframe. Afterwards, any respondent that still has at least one missing answer will also get removed from our dataframe. For the purposes of the Project, doing both of these is much easier as compared to trying to find a way on how we could manually fill in the missing data points (though, doing this is also a possibility). For now, however, let us see what we have accomplished so far.

In [ ]:
# Replace all Empty Spaces with NaN Values
dataset_Proper_Copy.replace(" ", np.NaN, inplace = True)

# Drop all Survey Questions with More than Half their Data Missing ((27821 * 0.5) < 13911)
dataset_Proper_Copy.dropna(axis = 1, thresh = 13911, inplace = True)

# Drop all Rows with at least One Missing Value
dataset_Proper_Copy.dropna(axis = 0, how = "any", inplace = True)

# Display the Newly Filtered Dataframe
dataFrame_Print_Missing_Info(dataset_Proper_Copy)
DataType TotalCount NullCount NullPercent
Country code and phase object 12026 0 0.0
Cluster number int64 12026 0 0.0
Household number int64 12026 0 0.0
Respondent's line number int64 12026 0 0.0
Ultimate area unit int64 12026 0 0.0
Women's individual sample weight (6 decimals) int64 12026 0 0.0
Month of interview int64 12026 0 0.0
Year of interview int64 12026 0 0.0
Date of interview (CMC) int64 12026 0 0.0
Date of interview Century Day Code (CDC) int64 12026 0 0.0
Respondent's month of birth int64 12026 0 0.0
Respondent's year of birth int64 12026 0 0.0
Date of birth (CMC) int64 12026 0 0.0
Respondent's current age int64 12026 0 0.0
Age in 5-year groups int64 12026 0 0.0
Completeness of age information int64 12026 0 0.0
Result of individual interview int64 12026 0 0.0
Day of interview int64 12026 0 0.0
CMC start of calendar int64 12026 0 0.0
Row of month of interview int64 12026 0 0.0
Length of calendar int64 12026 0 0.0
Number of calendar columns int64 12026 0 0.0
Ever-married sample int64 12026 0 0.0
Primary sampling unit int64 12026 0 0.0
Sample strata for sampling errors int64 12026 0 0.0
Stratification used in sample design int64 12026 0 0.0
Region int64 12026 0 0.0
Type of place of residence int64 12026 0 0.0
Number of visits int64 12026 0 0.0
Interviewer identification int64 12026 0 0.0
Field supervisor int64 12026 0 0.0
Line number of husband object 12026 0 0.0
Cluster altitude in meters int64 12026 0 0.0
Household selected for hemoglobin int64 12026 0 0.0
Selected for Domestic Violence module int64 12026 0 0.0
Language of questionnaire int64 12026 0 0.0
Language of interview int64 12026 0 0.0
Native language of respondent int64 12026 0 0.0
Translator used int64 12026 0 0.0
Team number int64 12026 0 0.0
Team supervisor int64 12026 0 0.0
Region int64 12026 0 0.0
Type of place of residence int64 12026 0 0.0
Years lived in place of residence int64 12026 0 0.0
Highest educational level int64 12026 0 0.0
Highest year of education object 12026 0 0.0
Ever heard of a Sexually Transmitted Infection (STI) int64 12026 0 0.0
Ever heard of AIDS int64 12026 0 0.0
Condom used during last sex with most recent partner object 12026 0 0.0
Had any STI in last 12 months int64 12026 0 0.0
Had genital sore/ulcer in last 12 months int64 12026 0 0.0
Had genital discharge in last 12 months int64 12026 0 0.0
Number of sex partners, excluding spouse, in last 12 months int64 12026 0 0.0
Number of sex partners, including spouse, in last 12 months int64 12026 0 0.0
Relationship with most recent sex partner object 12026 0 0.0
Ever been tested for HIV int64 12026 0 0.0
Heard about other STIs int64 12026 0 0.0
Used a method at last sexual intercourse object 12026 0 0.0
Wife justified asking husband to use condom if he has STI int64 12026 0 0.0
Drugs to avoid HIV transmission to baby during pregnancy object 12026 0 0.0
Would buy vegetables from vendor with HIV object 12026 0 0.0
Total lifetime number of sex partners object 12026 0 0.0
Heard of ARVs to treat HIV object 12026 0 0.0
Respondent can refuse sex object 12026 0 0.0
Respondent can ask partner to use a condom object 12026 0 0.0
Knowledge and use of HIV test kits object 12026 0 0.0
Children with HIV should be allowed to attend school with children without HIV object 12026 0 0.0

DROP IRRELEVANT SURVEY QUESTIONS

We have now significantly reduced the total number of columns in our dataframe. However, we are not done yet. Like what I said, we don't need to know all of these details. Here is a list of columns that we will be deleting because they are unnecessary for the Project. Let's drop them all using the drop() function.

In [ ]:
# Curate a List of the Columns to be Deleted
unnecessary_Columns = [
    "Country code and phase", 
    "Cluster number", 
    "Household number", 
    "Respondent's line number", 
    "Ultimate area unit", 
    "Women's individual sample weight (6 decimals)", 
    "Month of interview", 
    "Year of interview", 
    "Date of interview (CMC)", 
    "Date of interview Century Day Code (CDC)", 
    "Date of birth (CMC)", 
    "Completeness of age information", 
    "Result of individual interview", 
    "Day of interview", 
    "CMC start of calendar", 
    "Row of month of interview", 
    "Length of calendar", 
    "Number of calendar columns",
    "Ever-married sample",
    "Primary sampling unit", 
    "Sample strata for sampling errors", 
    "Stratification used in sample design", 
    "Number of visits", 
    "Interviewer identification", 
    "Field supervisor", 
    "Line number of husband", 
    "Cluster altitude in meters", 
    "Household selected for hemoglobin", 
    "Selected for Domestic Violence module",
    "Years lived in place of residence", 
    "Team number", 
    "Team supervisor"
]

# Drop all the Irrelevant Columns
dataset_Proper_Copy.drop(columns = unnecessary_Columns, inplace = True)

REMOVE DUPLICATES AND RESET INDICES

Next, let's do the following:

  1. Remove all Duplicate Columns.
  2. Reset the Column Indices.
In [ ]:
# Remove all Duplicate Columns
dataset_Proper_Copy = dataset_Proper_Copy.loc[:, ~dataset_Proper_Copy.columns.duplicated()]

# Reset the Column Indices
dataset_Proper_Copy.reset_index(drop = True, inplace = True)

# Let us view our changes
dataset_Proper_Copy.head()
Out[ ]:
Respondent's month of birth Respondent's year of birth ... Knowledge and use of HIV test kits Children with HIV should be allowed to attend school with children without HIV
0 8 1977 ... 0 8
1 9 2000 ... 0 8
2 7 1993 ... 0 1
3 7 1980 ... 0 0
4 5 1982 ... 0 8

5 rows × 33 columns

CONVERT DATATYPES

Finally, it's time to deal with the Datatypes. Ideally, we would want to work with a single data type all throughout the entire dataframe. More specifically, since our Project focuses on finding a numerical relationship between certain Features and overall HIV Perception, we would like to work with Integers.

However, since this is a Survey Questionnaire, I am not so convinced that every single String entry here can not be encoded into Integer values. Let's investigate this theory. What exactly do all the Object Columns really contain?

In [ ]:
# Find the actual Values each Object Column contains
for col in dataset_Proper_Copy.columns:
    if dataset_Proper_Copy[col].dtype == "object":
        display(dataset_Proper_Copy[col].value_counts())
Highest year of education
4     4694
6     3347
2     1412
3     1110
1     1004
5      429
8       20
98       6
7        4
Name: count, dtype: int64
Condom used during last sex with most recent partner
0    11737
1      289
Name: count, dtype: int64
Relationship with most recent sex partner
1    8508
7    3506
2      10
4       2
Name: count, dtype: int64
Used a method at last sexual intercourse
1    7071
0    4955
Name: count, dtype: int64
Drugs to avoid HIV transmission to baby during pregnancy
1    6465
0    3537
8    2024
Name: count, dtype: int64
Would buy vegetables from vendor with HIV
0    6965
1    4345
8     716
Name: count, dtype: int64
Total lifetime number of sex partners
1     9481
2     1953
3      405
4       91
5       54
98      10
6        9
8        7
95       4
10       4
7        3
15       3
13       1
20       1
Name: count, dtype: int64
Heard of ARVs to treat HIV
0    8007
1    4019
Name: count, dtype: int64
Respondent can refuse sex
1    11003
0      896
8      127
Name: count, dtype: int64
Respondent can ask partner to use a condom
1    9073
0    2473
8     480
Name: count, dtype: int64
Knowledge and use of HIV test kits
0    9635
2    2303
1      88
Name: count, dtype: int64
Children with HIV should be allowed to attend school with children without HIV
0    6614
1    4571
8     841
Name: count, dtype: int64

My hunch was correct. These are Strings that can easily be converted into actual Integer values. Using the astype() function, let's change the data type of all the Object Columns into 64-Bit Integers.

In [ ]:
# Convert each Object Column into an Integer Column
for col in dataset_Proper_Copy.columns:
    if dataset_Proper_Copy[col].dtype == "object":
        dataset_Proper_Copy[col] = dataset_Proper_Copy[col].astype(np.int64)

# Display the Newly Filtered Dataframe
dataFrame_Print_Missing_Info(dataset_Proper_Copy)
DataType TotalCount NullCount NullPercent
Respondent's month of birth int64 12026 0 0.0
Respondent's year of birth int64 12026 0 0.0
Respondent's current age int64 12026 0 0.0
Age in 5-year groups int64 12026 0 0.0
Region int64 12026 0 0.0
Type of place of residence int64 12026 0 0.0
Language of questionnaire int64 12026 0 0.0
Language of interview int64 12026 0 0.0
Native language of respondent int64 12026 0 0.0
Translator used int64 12026 0 0.0
Highest educational level int64 12026 0 0.0
Highest year of education int64 12026 0 0.0
Ever heard of a Sexually Transmitted Infection (STI) int64 12026 0 0.0
Ever heard of AIDS int64 12026 0 0.0
Condom used during last sex with most recent partner int64 12026 0 0.0
Had any STI in last 12 months int64 12026 0 0.0
Had genital sore/ulcer in last 12 months int64 12026 0 0.0
Had genital discharge in last 12 months int64 12026 0 0.0
Number of sex partners, excluding spouse, in last 12 months int64 12026 0 0.0
Number of sex partners, including spouse, in last 12 months int64 12026 0 0.0
Relationship with most recent sex partner int64 12026 0 0.0
Ever been tested for HIV int64 12026 0 0.0
Heard about other STIs int64 12026 0 0.0
Used a method at last sexual intercourse int64 12026 0 0.0
Wife justified asking husband to use condom if he has STI int64 12026 0 0.0
Drugs to avoid HIV transmission to baby during pregnancy int64 12026 0 0.0
Would buy vegetables from vendor with HIV int64 12026 0 0.0
Total lifetime number of sex partners int64 12026 0 0.0
Heard of ARVs to treat HIV int64 12026 0 0.0
Respondent can refuse sex int64 12026 0 0.0
Respondent can ask partner to use a condom int64 12026 0 0.0
Knowledge and use of HIV test kits int64 12026 0 0.0
Children with HIV should be allowed to attend school with children without HIV int64 12026 0 0.0

NUMERICAL SUMMARY OF THE DATAFRAME

We have now finished the actual cleaning part of the Project. From here on out, up until the Data Exploration Section, we will be focusing on enhancing the Proper Dataframe more than anything else.

To start, let's take a step back and observe what we have done. Let's determine the high-level, numerical summary of the data that we are dealing with so far.

In [ ]:
def dataFrame_Print_Numerical_Description(dataframe: pd.DataFrame):
    # Create a new Dataframe to hold our Missing Data Info
    description_dataFrame = pd.DataFrame(index = dataframe.columns)

    # Determine the Min, Mean, Median, Mode, and Max Values for each Column
    for col in dataframe.columns:
        description_dataFrame.loc[col, "MIN"] = dataframe[col].min().round(2)
        description_dataFrame.loc[col, "MEAN"] = dataframe[col].mean().round(2)
        description_dataFrame.loc[col, "MEDIAN"] = dataframe[col].median().round(2)
        description_dataFrame.loc[col, "MODE"] = dataframe[col].mode().round(2).iloc[0]
        description_dataFrame.loc[col, "MAX"] = dataframe[col].max().round(2)
    
    # Display a Numerical Description of the Dataframe
    display(description_dataFrame)

dataFrame_Print_Numerical_Description(dataset_Proper_Copy)
MIN MEAN MEDIAN MODE MAX
Respondent's month of birth 1.0 6.66 7.0 12.0 12.0
Respondent's year of birth 1972.0 1985.55 1985.0 1979.0 2007.0
Respondent's current age 15.0 35.87 36.0 42.0 49.0
Age in 5-year groups 1.0 4.76 5.0 6.0 7.0
Region 1.0 9.09 9.0 13.0 17.0
Type of place of residence 1.0 1.60 2.0 2.0 2.0
Language of questionnaire 1.0 3.94 2.0 2.0 7.0
Language of interview 1.0 8.08 3.0 2.0 96.0
Native language of respondent 1.0 16.87 6.0 2.0 96.0
Translator used 0.0 0.08 0.0 0.0 1.0
Highest educational level 1.0 2.22 2.0 2.0 3.0
Highest year of education 1.0 4.07 4.0 4.0 98.0
Ever heard of a Sexually Transmitted Infection (STI) 1.0 1.00 1.0 1.0 1.0
Ever heard of AIDS 1.0 1.00 1.0 1.0 1.0
Condom used during last sex with most recent partner 0.0 0.02 0.0 0.0 1.0
Had any STI in last 12 months 0.0 0.01 0.0 0.0 8.0
Had genital sore/ulcer in last 12 months 0.0 0.05 0.0 0.0 8.0
Had genital discharge in last 12 months 0.0 0.08 0.0 0.0 8.0
Number of sex partners, excluding spouse, in last 12 months 0.0 0.00 0.0 0.0 2.0
Number of sex partners, including spouse, in last 12 months 1.0 1.03 1.0 1.0 98.0
Relationship with most recent sex partner 1.0 2.75 1.0 1.0 7.0
Ever been tested for HIV 0.0 0.12 0.0 0.0 1.0
Heard about other STIs 0.0 0.37 0.0 0.0 1.0
Used a method at last sexual intercourse 0.0 0.59 1.0 1.0 1.0
Wife justified asking husband to use condom if he has STI 0.0 0.98 1.0 1.0 8.0
Drugs to avoid HIV transmission to baby during pregnancy 0.0 1.88 1.0 1.0 8.0
Would buy vegetables from vendor with HIV 0.0 0.84 0.0 0.0 8.0
Total lifetime number of sex partners 1.0 1.40 1.0 1.0 98.0
Heard of ARVs to treat HIV 0.0 0.33 0.0 0.0 1.0
Respondent can refuse sex 0.0 1.00 1.0 1.0 8.0
Respondent can ask partner to use a condom 0.0 1.07 1.0 1.0 8.0
Knowledge and use of HIV test kits 0.0 0.39 0.0 0.0 2.0
Children with HIV should be allowed to attend school with children without HIV 0.0 0.94 0.0 0.0 8.0

THESE ARE ALRIGHT... BUT WE CAN DO MORE.

Technically speaking, we can already start analyzing the data as it already is. However, there are a few more things I would like to modify here before wrapping everything up:

  1. First of all, while it is true that all of these Survey Questions would be useful to us in one way or another, I think we ought to only keep the columns that are most relevant to the Research Questions. For instance, while it is great to know the month and year of a respondent's birth, I think knowing her age would be more than enough information for us.

  2. Second, while most of these Survey Questions have either a "YES" or a "NO" answer to them, there are a few items that allow for either an "I DON'T KNOW" answer or different variations of "YES" and "NO". To keep it clean and simple, and since we are talking about HIV Perception anyway, let's make all "I DON'T KNOW" answers count as 0's and all remaining choices count as 1's. Also, some columns need to be flipped as saying "YES" could mean a bad thing and vice versa.

Let's first split the Proper Dataframe into six sub-components after dropping the columns: Age, Region, Residence, Language, Educational Level, and Perception. Afterwards, let's go on and finish this Section from here.

In [ ]:
# Curate a List of the Additional Columns to be Deleted
unnecessary_Columns = [
    "Respondent's current age",
    "Respondent's month of birth", 
    "Respondent's year of birth",
    "Language of questionnaire",
    "Language of interview",
    "Translator used",
    "Highest year of education",
    "Number of sex partners, excluding spouse, in last 12 months", 
    "Number of sex partners, including spouse, in last 12 months", 
    "Relationship with most recent sex partner", 
    "Total lifetime number of sex partners"
]

# Drop all the Irrelevant Columns
dataset_Proper_Copy.drop(columns = unnecessary_Columns, inplace = True)

# Split the Dataframe into the Six Sub-Components
age_dataFrame = dataset_Proper_Copy["Age in 5-year groups"]
region_dataFrame = dataset_Proper_Copy["Region"]
residence_dataFrame = dataset_Proper_Copy["Type of place of residence"]
language_dataFrame = dataset_Proper_Copy["Native language of respondent"]

educLevel_dataFrame = dataset_Proper_Copy["Highest educational level"]

perception_dataFrame = dataset_Proper_Copy.iloc[:, 5:22]

# Replace the Extra Perception Dataframe Values
perception_dataFrame.replace(8, 0, inplace = True)
perception_dataFrame.replace(2, 1, inplace = True)

# Switch the Values of these Columns
cols_to_flip = ["Had any STI in last 12 months", "Had genital sore/ulcer in last 12 months", "Had genital discharge in last 12 months"]
perception_dataFrame[cols_to_flip] = perception_dataFrame[cols_to_flip].map(lambda x: 0 if x == 1 else 1)

# Concatenate all Six Dataframes back into just One
list_of_sixDataFrames = [age_dataFrame, region_dataFrame, residence_dataFrame, language_dataFrame, educLevel_dataFrame, perception_dataFrame]
merged_dataFrame = pd.concat(list_of_sixDataFrames, axis = 1)

# Display the Numerical Description of the new Dataframe
dataFrame_Print_Numerical_Description(merged_dataFrame)
MIN MEAN MEDIAN MODE MAX
Age in 5-year groups 1.0 4.76 5.0 6.0 7.0
Region 1.0 9.09 9.0 13.0 17.0
Type of place of residence 1.0 1.60 2.0 2.0 2.0
Native language of respondent 1.0 16.87 6.0 2.0 96.0
Highest educational level 1.0 2.22 2.0 2.0 3.0
Ever heard of a Sexually Transmitted Infection (STI) 1.0 1.00 1.0 1.0 1.0
Ever heard of AIDS 1.0 1.00 1.0 1.0 1.0
Condom used during last sex with most recent partner 0.0 0.02 0.0 0.0 1.0
Had any STI in last 12 months 0.0 0.99 1.0 1.0 1.0
Had genital sore/ulcer in last 12 months 0.0 0.97 1.0 1.0 1.0
Had genital discharge in last 12 months 0.0 0.94 1.0 1.0 1.0
Ever been tested for HIV 0.0 0.12 0.0 0.0 1.0
Heard about other STIs 0.0 0.37 0.0 0.0 1.0
Used a method at last sexual intercourse 0.0 0.59 1.0 1.0 1.0
Wife justified asking husband to use condom if he has STI 0.0 0.78 1.0 1.0 1.0
Drugs to avoid HIV transmission to baby during pregnancy 0.0 0.54 1.0 1.0 1.0
Would buy vegetables from vendor with HIV 0.0 0.36 0.0 0.0 1.0
Heard of ARVs to treat HIV 0.0 0.33 0.0 0.0 1.0
Respondent can refuse sex 0.0 0.91 1.0 1.0 1.0
Respondent can ask partner to use a condom 0.0 0.75 1.0 1.0 1.0
Knowledge and use of HIV test kits 0.0 0.20 0.0 0.0 1.0
Children with HIV should be allowed to attend school with children without HIV 0.0 0.38 0.0 0.0 1.0

II. DATA EXPLORATION

DETERMINE THE OVERALL PERCEPTION MEAN

Now, let's finally start exploring and analyzing our data. There are two things I would like to do first:

  1. Let's rename some of these columns so that they become more concise and easier to read.
  2. Let's try to quantify the "Perception" of a Respondent.

There are different ways of going about measuring perception. One way is through simply getting the Average across all the Survey Questions. This is a quick and easy way to going about doing this but, of course, there are other ways too (such as through a Weighted Average, let's say). For now, let's do these and see what we get.

In [ ]:
# Create a Dictionary of the Column Names to be Replaced
rename_toLessWords_Dict = {
    "Ever heard of a Sexually Transmitted Infection (STI)": "Ever heard of a STI",
    "Condom used during last sex with most recent partner": "Condom used during last sex",
    "Had any STI in last 12 months": "Had any STI in last year",
    "Had genital sore/ulcer in last 12 months": "Had genital sore/ulcer last year",
    "Had genital discharge in last 12 months": "Had genital discharge last year",
    "Used a method at last sexual intercourse": "Used a method last sex",
    "Wife justified asking husband to use condom if he has STI": "Justified asking husband with STI to use condom",
    "Drugs to avoid HIV transmission to baby during pregnancy": "Knows Anti-HIV transmission to baby Drugs",
    "Would buy vegetables from vendor with HIV": "Would buy vegetables from vendor with HIV",
    "Heard of ARVs to treat HIV": "Heard of ARVs",
    "Respondent can ask partner to use a condom": "Can ask to use condom",
    "Knowledge and use of HIV test kits": "Knowledge of HIV test kits",
    "Children with HIV should be allowed to attend school with children without HIV": "Kids w/ HIV allowed schooling w/ Kids w/o HIV"
}

# Rename the Specified Column
merged_dataFrame.rename(columns = rename_toLessWords_Dict, inplace = True)

# Create a new Column for the Average of the Survey Question Values
merged_dataFrame["Perception Mean"] = perception_dataFrame.mean(axis = 1)
percMean_dataFrame = merged_dataFrame["Perception Mean"]

# Display the Numerical Description of the new Dataframe
dataFrame_Print_Numerical_Description(merged_dataFrame)
MIN MEAN MEDIAN MODE MAX
Age in 5-year groups 1.00 4.76 5.00 6.00 7.0
Region 1.00 9.09 9.00 13.00 17.0
Type of place of residence 1.00 1.60 2.00 2.00 2.0
Native language of respondent 1.00 16.87 6.00 2.00 96.0
Highest educational level 1.00 2.22 2.00 2.00 3.0
Ever heard of a STI 1.00 1.00 1.00 1.00 1.0
Ever heard of AIDS 1.00 1.00 1.00 1.00 1.0
Condom used during last sex 0.00 0.02 0.00 0.00 1.0
Had any STI in last year 0.00 0.99 1.00 1.00 1.0
Had genital sore/ulcer last year 0.00 0.97 1.00 1.00 1.0
Had genital discharge last year 0.00 0.94 1.00 1.00 1.0
Ever been tested for HIV 0.00 0.12 0.00 0.00 1.0
Heard about other STIs 0.00 0.37 0.00 0.00 1.0
Used a method last sex 0.00 0.59 1.00 1.00 1.0
Justified asking husband with STI to use condom 0.00 0.78 1.00 1.00 1.0
Knows Anti-HIV transmission to baby Drugs 0.00 0.54 1.00 1.00 1.0
Would buy vegetables from vendor with HIV 0.00 0.36 0.00 0.00 1.0
Heard of ARVs 0.00 0.33 0.00 0.00 1.0
Respondent can refuse sex 0.00 0.91 1.00 1.00 1.0
Can ask to use condom 0.00 0.75 1.00 1.00 1.0
Knowledge of HIV test kits 0.00 0.20 0.00 0.00 1.0
Kids w/ HIV allowed schooling w/ Kids w/o HIV 0.00 0.38 0.00 0.00 1.0
Perception Mean 0.18 0.60 0.59 0.59 1.0

FIND THE PERCEPTION MEAN GROUPED BY FEATURE

Let's see how Age, Region, Residence, Language, and Educational Level all correspond to the Perception Mean.

In [ ]:
# Curate the List of Features
list_of_featureStrings = ["Age in 5-year groups", "Region", "Type of place of residence", "Native language of respondent", "Highest educational level"]

# Concatenate all the Dataframe Groups into One
list_of_groupDataFrames = [age_dataFrame, region_dataFrame, residence_dataFrame, language_dataFrame, educLevel_dataFrame, percMean_dataFrame]
merged_groups_dataFrame = pd.concat(list_of_groupDataFrames, axis = 1)

def dataFrame_Print_meanPerception_byGroup(dataframe: pd.DataFrame, feature: any, targetstring: str):
    
    # Determine the Perception Mean (Count, Min, and Max included) for the Feature
    meanPerception_dataFrame = dataframe.groupby(feature)[targetstring].agg([
        ("PERCEPTION COUNT", "count"),
        ("PERCEPTION MEAN", lambda x: round(x.mean(), 4)),
        ("PERCEPTION MIN", lambda x: round(x.min(), 4)),
        ("PERCEPTION MAX", lambda x: round(x.max(), 4))
    ])

    # Display the Feature Group
    display(meanPerception_dataFrame)

dataFrame_Print_meanPerception_byGroup(merged_groups_dataFrame, "Highest educational level", "Perception Mean")
PERCEPTION COUNT PERCEPTION MEAN PERCEPTION MIN PERCEPTION MAX
Highest educational level
1 1807 0.5495 0.1765 0.8824
2 5825 0.5967 0.2353 1.0000
3 4394 0.6366 0.2353 1.0000

III. DATA VISUALIZATION

GRAPH THE PERCEPTION MEAN BOX PLOT

Let's visualize the Perception Mean Distribution by Educational Levels using a Box Plot.

In [ ]:
# Define the Label Name Mapping
educLevel_group_mapping = {
    1: "Primary",
    2: "Secondary",
    3: "Higher"
}

merged_groups_dataFrame["Highest Educational Level"] = merged_groups_dataFrame["Highest educational level"].map(educLevel_group_mapping)
educLevel_groups_order = ["Primary", "Secondary", "Higher"]

def add_box_traces(fig, df, educLevel_group_column, perception_mean_column, educLevel_groups):
    # Add a box trace for each age group to the figure
    for age_group in educLevel_groups:
        fig.add_trace(go.Box(
            y = df[df[educLevel_group_column] == age_group][perception_mean_column],
            name = str(age_group),
            boxpoints = "outliers",
            jitter = 0.5,
            whiskerwidth = 0.2,
            marker = dict(size = 2),
            line = dict(width = 1),
        ))
    return fig

def add_vertical_line(fig):
    # Add a vertical line at the Median of the figure
    fig.add_shape(
        type = "line",
        x0 = -1,
        y0 = 0.5,
        x1 = 3,
        y1 = 0.5,
        yref = "paper",
        line = dict(color="White", width = 2, dash = "dash"),
        opacity = 0.5
    )
    return fig

def update_axes(fig, educLevel_group_column, perception_mean_column):
    # Update the x and y axes of the figure
    fig.update_xaxes(title_text = educLevel_group_column, showline = True, linewidth = 2, linecolor = "White", ticksuffix = " ", title_standoff = 25)
    fig.update_yaxes(title_text = perception_mean_column, showline = True, linewidth = 2, linecolor = "White", ticksuffix = " ", range = [0, 1], title_standoff = 25)
    return fig

def update_layout(fig):
    # Update the layout of the figure
    fig.update_layout(
        title_text = f"<b>PERCEPTION MEAN BOX PLOT</b>",
        title_font_family = "Roboto",
        title_font_size = 40,
        title_x = 0.5,
        font_family = "Roboto",
        font_size = 20,
        width = 1000,
        height = 500,
        template = "plotly_dark",
        margin = dict(l = 200, r = 250, t = 200, b = 150)
    )
    return fig

def create_box_plot(df, educLevel_group_column, perception_mean_column, educLevel_groups):
    # Create a Box Plot with the given DataFrame and Columns
    fig = go.Figure()
    fig = add_box_traces(fig, df, educLevel_group_column, perception_mean_column, educLevel_groups)
    fig = add_vertical_line(fig)
    fig = update_axes(fig, educLevel_group_column, perception_mean_column)
    fig = update_layout(fig)
    fig.show()

# Display the figure
create_box_plot(merged_groups_dataFrame, "Highest Educational Level", "Perception Mean", educLevel_groups_order)

GRAPH THE PERCEPTION MEAN NORMAL DISTRIBUTION

Let's visualize the Normal Distribution of the Perception Mean by Educational Levels.

In [ ]:
# Define the Label Name Mapping
educLevel_group_mapping = {
    1: " Primary",
    2: " Secondary",
    3: " Higher"
}

merged_groups_dataFrame["Highest Educational Level"] = merged_groups_dataFrame["Highest educational level"].map(educLevel_group_mapping)
educLevel_groups_order = [" Primary", " Secondary", " Higher"]

def add_scatter_traces(fig, df, group_column, value_column, group_order):
    # Add a scatter trace for each group to the figure
    for group in group_order:
        group_data = df[df[group_column] == group][value_column]
        mu, std = sp.stats.norm.fit(group_data)
        x = np.linspace(min(group_data), max(group_data), 100)
        p = sp.stats.norm.pdf(x, mu, std)
        fig.add_trace(go.Scatter(x=x, y=p, mode='lines', name=str(group)))
    return fig

def add_vertical_line(fig):
    # Add a vertical line at the Median of the figure
    fig.add_shape(
        type = "line",
        x0 = 0.5,
        y0 = 0,
        x1 = 0.5,
        y1 = 1,
        yref = "paper",
        line = dict(color = "White", width = 2, dash = "dash"),
        opacity = 0.5
    )
    return fig

def update_axes(fig, x_title, y_title):
    # Update the x and y axes of the figure
    fig.update_xaxes(title_text = x_title, showline = True, linewidth = 2, linecolor = "White", ticksuffix = " ", range = [0, 1], title_standoff = 25)
    fig.update_yaxes(title_text = y_title, showline = True, linewidth = 2, linecolor = "White", ticksuffix = " ", range = [0, 4], title_standoff = 25)
    return fig

def update_layout(fig):
    # Update the layout of the figure
    fig.update_layout(
        title_text = f"<b>PERCEPTION MEAN NORMAL DISTRIBUTION</b>",
        title_font_family = "Roboto",
        title_font_size = 40,
        title_x = 0.5,
        font_family = "Roboto",
        font_size = 20,
        width = 1000,
        height = 500,
        template = "plotly_dark",
        margin = dict(l = 200, r = 250, t = 200, b = 150)
    )
    return fig

def create_grouped_scatter_plot(df, group_column, value_column, group_order):
    # Create a grouped scatter plot with the given DataFrame and columns
    fig = go.Figure()
    fig = add_scatter_traces(fig, df, group_column, value_column, group_order)
    fig = add_vertical_line(fig)
    fig = update_axes(fig, "Perception Mean", "Probability Density")
    fig = update_layout(fig)
    fig.show()

# Display the figure
create_grouped_scatter_plot(merged_groups_dataFrame, "Highest Educational Level", "Perception Mean", educLevel_groups_order)